Neural Networks and Deep Learning

12/01/2020

Agenda

Goals
Simple Neural Networks
Deep Neural Networks
Towards real data applications

Goals

Introduction

Neural Networks (NN) are flexible models for supervised learning:
- Regression
- Classification
NN can approximate any continuous function arbitrarily well
Historically, NN were inspired by modeling biological neural networks

Introduction

Our neural system has 86 billion neurons and they are connected with \(10^{14}\) - \(10^{15}\) synapses
Neuron receives input signals from its dendrites and produces output signals along its axon
The axon eventually branches out and connects via synapses to dendrites of other neurons

How it works

\(x_0,x_1,x_2\): input signals
\(w_0, w_1, w_2\): weights or synaptic strengths
If the weighted sum of the input signals is above certain threshold, the neuron fires and sends a spike along its axon
\(f(\cdot)\): activation function, which outputs the frequency of neuron firing

Simple Neural Networks

Activation functions

Sigmoid: it maps real-valued input to a range between 0 and 1
TanH: it maps real-valued input to a range between -1 and 1
ReLU (Rectified Linear Unit): it takes a real-valued input and thresholds it at zero (replaces negative values with zero)
We only consider nonlinear activation functions because linear activations degenerate (linear transformation of linear transformation is still linear transformation)

NN training

Regression: find weights (parameters) which minimize the squared error loss \[\widehat{\mathbf{w}} = \text{argmin}_{\mathbf{w}} \dfrac{1}{2} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\] where \(\widehat{y}_i\) is the output from the NN with input \(x_i\) and weights \(\widehat{\mathbf{w}}\)
Binary classification: find weights which minimize the cross-entropy loss \[\widehat{\mathbf{w}} = \text{argmin}_{\mathbf{w}} \sum_{i=1}^{n} \left\{ - y_i \log (\hat{y}_i) - (1 - y_i) \log (1 - \hat{y}_i) \right\}\] where \(0 \leq \widehat{y}_i \leq 1\) is the output from the NN with sigmoid activation function in the last layer

NN training

How to minimize these loss functions? We use gradient descent (via back-propagation) to find \(\widehat{\mathbf{w}}\)!

Forward-propagation: given inputs and weights, the outputs are determined by following the network
Backward-propagation: training the NN (i.e estimating the weights) is via a reverse process, (taking derivatives of error) from outputs back to inputs

For those who are interested: stochastic gradient descent, mini-batches, adam algorithm

Coding complex Neural Networks can be challenging: thankfully, there are some existing frameworks that do it for us. Check out https://www.tensorflow.org/

Gradient descent

Deep Neural Networks

Deeper Neural Networks

Number of layers and hidden units

More layers and hidden units increase the flexibility of the NN but tends to overfit
Rule of thumb: choose 2 or 3 hidden layers and moderate to large number of hidden units and use regularization (e.g. \(\ell_2\) regularization, dropout) to prevent overfitting

\(\ell_2\) regularization

Add \(1/2 \lambda w^2\) to the loss/objective function for each weight \(w\). \(\lambda\) is a tuning parameter that controls the strength of the regularization
Each neural network above has 20 hidden units, but changing the regularization strength makes its final decision regions smoother with a higher regularization
In practice, \(\lambda\) is chosen using cross-validation

Dropout

In practice, dropout rate is often chosen to be \(50\%\)

Practical Issues

Data preprocessing
- Normalization: standardization or scale each variable to \((-1, 1)\)
- Whitening: principal component analysis
Weight initialization
- Do not set all the initial weights to zero
- Instead, for a ReLU neuron with \(m\) inputs, draw \(w_i \sim \mathcal{N}(0, 2/m)\)

Towards real data applications

Convolutional Neural Networks

Why not use regular NN?

A 2D color image is a 3 dimensional array (width, height, depth) of pixel values, where Depth corresponds to 3 color channels
A greyscale image is a matrix (width, height) of pixel values
A small \(200 \times 200\) color image would lead to \(120,000\) weights for each hidden unit
We may need hundreds of hidden units
The full connectivity of a regular NN is wasteful and the huge number of parameters would quickly lead to overfitting

Convolutional Neural Networks (CNN)

Convolutional Neural Networks (CNN or ConvNet) is a NN specifically designed for image inputs
Very popular in computer vision and imaging analysis
Most common task is to classify images

1. Convolution

We slide the orange matrix over our original image (green) by 1 pixel (also called stride) at the time and for every position, we compute element wise multiplication and add the result to get the corresponding element of the output matrix (pink).

2. ReLU

Apply ReLU (element wise) to the feature map to introduce non-linearity.

3. Pooling

Max pooling progressively reduces the spatial size of each feature map while keeping the most important information. It reduces the amount of parameters and computation in the network, and hence also control overfitting.

4. Fully Connected Layer

The output from the convolutional and pooling layers represent high-level features of the input image. The purpose of the Fully Connected layer is to use these features for classifying the input image into various classes based on the training dataset.

Summary

http://scs.ryerson.ca/~aharley/vis/conv/flat.html

http://cs231n.stanford.edu/

Conclusions

DNN are very popular these days. They seem to work best in highly non-linear but low-noise problems (think of images). It is unclear how successful they are in high-noise social science/economics type of applications
Machine Learning vs Statistical Learning
Quantification of uncertainty in Neural Networks is an active area of research

e-CIS

Please fill out the anonymous electronic course evaluation. Feel free to leave your feedback, the course can always improve thanks to students’ input!

https://utdirect.utexas.edu/ctl/ecis/

Agenda

Goals

Introduction

Introduction

How it works

Simple Neural Networks

Simple Neural Networks

Activation functions

NN training

NN training

Gradient descent

Deep Neural Networks

Deeper Neural Networks

Number of layers and hidden units

\(\ell_2\) regularization

Dropout

Practical Issues

Towards real data applications

Convolutional Neural Networks

Convolutional Neural Networks

Why not use regular NN?

Convolutional Neural Networks (CNN)

1. Convolution

1. Convolution

2. ReLU

3. Pooling

4. Fully Connected Layer

Summary

Conclusions

e-CIS

Question time